LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

217次阅读

LG – 机器学习 CV – 计算机视觉 CL – 计算与语言

1、[CV] PC²: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction
2、[LG] Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation
3、[LG] Neural Algorithmic Reasoning with Causal Regularisation
4、[CV] RealFusion: 360° Reconstruction of Any Object from a Single Image
5、[CV] Towards Universal Fake Image Detectors that Generalize Across Generative Models
[CL] ChatGPT: Jack of all trades, master of none
[LG] Infinite-Dimensional Diffusion Models for Function Spaces
[CV] Invertible Neural Skinning
[LG] Provable Copyright Protection for Generative Models

摘要：基于投影条件点云扩散的单图3D重建、无捷径深度Transformer、基于因果正则化的神经算法推理、任意目标的单图像360°重建、跨生成模型的通用真伪检测器研究、ChatGPT能力分析、函数空间无限维扩散模型、可逆神经表层建模、生成模型可证版权保护

1、[CV] PC²: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction

L Melas-Kyriazi, C Rupprecht, A Vedaldi
[University of Oxford]

PC2: 基于投影条件点云扩散的单图3D重建

要点:

基于点云的形状表示能实现高度灵活的扩散模型；
投影调节允许高分辨率的几何形状与输入图像很好地对齐；
扩散过程的概率性质能实现多个可信的3D点云，可以被过滤以解决单视图 3D 重建问题的不确定性；
在合成基准上优于之前的方法，在复杂的真实世界数据上实现了很大的质量改进。

一句话总结:
提出一种基于投影条件的条件去噪扩散过程的单图像 3D 重建方法，可获得高分辨率的稀疏几何图形，并能预测图形之外的点颜色。

Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.

https://arxiv.org/abs/2302.10668

2、[LG] Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation

B He, J Martens, G Zhang, A Botev, A Brock, S L Smith, Y W Teh
[DeepMind & University of Oxford]

无捷径深度 Transformer：修改自注意力以实现忠实的信号传播

要点:

最近的方法在减少对深度神经网络中的跳接和归一化层的依赖上取得了进展；
Transformer 中的自注意力层在分析和控制上更为复杂，但有可能通过参数初始化、偏置矩阵和位置依赖重缩放来训练深度 vanilla Transformer；
指数信号保持注意力(E-SPA)、U-SPA 和 Value-Skipinit 等方法能在深度 skipless Transformer 中实现忠实的信号传播，E-SPA 使其在大约5倍的迭代后能与标准 Transformer 的性能相匹配；
本文工作可能为新的和改进的深度学习架构铺平道路，并激发更多的研究，利用理论的洞察力提高深度学习的能力。

一句话总结:
深度 Transformer 可以在没有跳接或归一化层的情况下进行训练，使用控制注意力矩阵的新方法来实现忠实的信号传播。

Skip connections and normalisation layers form two standard architectural components that are ubiquitous for the training of Deep Neural Networks (DNNs), but whose precise roles are poorly understood. Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them, using insights from wide NN kernel theory to improve signal propagation in vanilla DNNs (which we define as networks without skips or normalisation). However, these approaches are incompatible with the self-attention layers present in transformers, whose kernels are intrinsically more complicated to analyse and control. And so the question remains: is it possible to train deep vanilla transformers? We answer this question in the affirmative by designing several approaches that use combinations of parameter initialisations, bias matrices and location-dependent rescaling to achieve faithful signal propagation in vanilla transformers. Our methods address various intricacies specific to signal propagation in transformers, including the interaction with positional encoding and causal masking. In experiments on WikiText-103 and C4, our approaches enable deep transformers without normalisation to train at speeds matching their standard counterparts, and deep vanilla transformers to reach the same performance as standard ones after about 5 times more iterations.

https://arxiv.org/abs/2302.10322

3、[LG] Neural Algorithmic Reasoning with Causal Regularisation

B Bevilacqua, K Nikiforou, B Ibarz, I Bica, M Paganini, C Blundell, J Mitrovic, P Veličković
[DeepMind & Purdue University]

基于因果正则化的神经算法推理

要点:

所提出的自监督学习目标，采用了从现有提示中得到的增强，代表了算法的中间步骤，将基于GNN的算法推理器的执行建立在目标算法所进行的计算之上；
因果图的设计，是为了捕捉这样的观察：算法在某一步骤的执行只由输入的一个子集决定；
自监督目标学习的表征对不影响计算步骤的输入子集的变化是不变的；
基于所提出的自监督目标的 Hint-ReLIC 模型使得目标算法的输出更加鲁棒，特别是与自回归暗示预测相比。

一句话总结:
基于因果正则化的神经算法推理，通过使用从算法的中间计算和基于因果图的自监督目标中得到的数据增强程序，提高了推理器的分布外泛化能力。

Recent work on neural algorithmic reasoning has investigated the reasoning capabilities of neural networks, effectively demonstrating they can learn to execute classical algorithms on unseen data coming from the train distribution. However, the performance of existing neural reasoners significantly degrades on out-of-distribution (OOD) test data, where inputs have larger sizes. In this work, we make an important observation: there are many emph{different} inputs for which an algorithm will perform certain intermediate computations emph{identically}. This insight allows us to develop data augmentation procedures that, given an algorithm’s intermediate trajectory, produce inputs for which the target algorithm would have emph{exactly} the same next trajectory step. Then, we employ a causal framework to design a corresponding self-supervised objective, and we prove that it improves the OOD generalisation capabilities of the reasoner. We evaluate our method on the CLRS algorithmic reasoning benchmark, where we show up to 3× improvements on the OOD test data.

https://arxiv.org/abs/2302.10258

4、[CV] RealFusion: 360° Reconstruction of Any Object from a Single Image

L Melas-Kyriazi, C Rupprecht, I Laina, A Vedaldi
[University of Oxford]

RealFusion: 任意目标的单图像360°重建

要点:

RealFusion 使用现成的扩散模型和提示，从单幅图像中生成目标的新视图；
利用了重建物体的多尺度辐射场表示，结合额外的正则化器来平滑表面；
可以对真实场景捕获的物体产生可信的 3D 重建，忠实于输入图像，并展示了最先进的重建结果；
未来工作包括为新视点合成任务专门设计扩散模型，并结合动力学来重建动画的 3D 场景。

一句话总结:
RealFusion 是一种新方法，利用现成的扩散模型并鼓励其”梦想出”物体的新视图的提示，从单幅图像中对任意物体进行360°摄影重建。

We consider the problem of reconstructing a full 360° photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to “dream up” novel views of the object. Using an approach inspired by DreamFields and DreamFusion, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image.

https://arxiv.org/abs/2302.10663

5、[CV] Towards Universal Fake Image Detectors that Generalize Across Generative Models

U Ojha, Y Li, Y J Lee
[University of Wisconsin-Madison]

跨生成模型的通用真伪检测器研究

要点:

现有的基于深度学习的方法在检测新生成模型的伪造图像方面是有限的；
限制的原因是由于分类器检测虚假模式的不对称调整，导致真实类持有所有的非伪造样本；
通过使用没有明确训练的特征空间来区分真伪图像，提出了无需学习的真假分类；
在这个空间中使用近邻和线性探测，使得检测伪造图像的泛化能力明显提高，特别是来自扩散/自回归模型等较新的方法。

一句话总结:
现有的基于深度学习的方法，无法从较新的生成模型中检测出伪造图像，一个简单的修复方法是使用一个没有经过真为分类训练的信息特征空间。

With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting classifier is asymmetrically tuned to detect patterns that make an image fake. The real class becomes a sink class holding anything that is not fake, including generated images from models not accessible during training. Building upon this discovery, we propose to perform real-vs-fake classification without learning; i.e., using a feature space not explicitly trained to distinguish real from fake images. We use nearest neighbor and linear probing as instantiations of this idea. When given access to the feature space of a large pretrained vision-language model, the very simple baseline of nearest neighbor classification has surprisingly good generalization ability in detecting fake images from a wide variety of generative models; e.g., it improves upon the SoTA by +15.07 mAP and +25.90% acc when tested on unseen diffusion and autoregressive models.

https://arxiv.org/abs/2302.10174

另外几篇值得关注的论文：

[CL] ChatGPT: Jack of all trades, master of none

J Kocoń, I Cichecki, O Kaszyca, M Kochanek, D Szydło, J Baran, J Bielaniewicz…
[University of Science and Technology, Poland]

ChatGPT: 多才多艺，一窍不通

要点:

ChatGPT可以很好地解决大多数NLP问题，但却输给了SOTA模型，尤其是在更困难和实用的任务上，如情感识别；
通过上下文少样本个性化进行个性化响应的能力是个有价值的功能；
ChatGPT具有独特的自我解释能力，有利于人理解和自适应预期结果；
本文结果为关于高质量的预测性 NLP 模型是否对社会有用以及如何为这类系统建立学习和验证程序的基本讨论提供了基础。

一句话总结:
ChatGPT 可以很好地解决大多数 NLP 问题，但输给了目前最好的模型(SOTA)，对于更困难和实用的任务，特别是情感判别，损失相对更大。本文提出的实现上下文少样本个性化的能力是一个有价值的特点，其独特的自我解释能力也是如此。

OpenAI has released the Chat Generative Pre-trained Transformer (ChatGPT) and revolutionized the approach in artificial intelligence to human-model interaction. The first contact with the chatbot reveals its ability to provide detailed and precise answers in various areas. There are several publications on ChatGPT evaluation, testing its effectiveness on well-known natural language processing (NLP) tasks. However, the existing studies are mostly non-automated and tested on a very limited scale. In this work, we examined ChatGPT’s capabilities on 25 diverse analytical NLP tasks, most of them subjective even to humans, such as sentiment analysis, emotion recognition, offensiveness and stance detection, natural language inference, word sense disambiguation, linguistic acceptability and question answering. We automated ChatGPT’s querying process and analyzed more than 38k responses. Our comparison of its results with available State-of-the-Art (SOTA) solutions showed that the average loss in quality of the ChatGPT model was about 25% for zero-shot and few-shot evaluation. We showed that the more difficult the task (lower SOTA performance), the higher the ChatGPT loss. It especially refers to pragmatic NLP problems like emotion recognition. We also tested the ability of personalizing ChatGPT responses for selected subjective tasks via Random Contextual Few-Shot Personalization, and we obtained significantly better user-based predictions. Additional qualitative analysis revealed a ChatGPT bias, most likely due to the rules imposed on human trainers by OpenAI. Our results provide the basis for a fundamental discussion of whether the high quality of recent predictive NLP models can indicate a tool’s usefulness to society and how the learning and validation procedures for such systems should be established.

https://arxiv.org/abs/2302.10724

[LG] Infinite-Dimensional Diffusion Models for Function Spaces

J Pidstrigach, Y Marzouk, S Reich, S Wang
[Universitat Potsdam & MIT]

函数空间无限维扩散模型

要点:

提出一种函数空间的无穷维扩散模型算法，其性能优于经典的扩散模型；
该算法对从样本到目标度量的距离有无维界，并对生成SDE路径和处理扩散模型的逆向问题有改进；
证明了正向和反向 SDE 形式化在无限维上成立，并提供了算法的收敛界；
该算法在无条件和有条件的采样中表现良好，并且实现了在无限维空间中进行有条件抽样的新方法。

一句话总结:
提出一种函数空间的无限维扩散模型算法，对经典的扩散模型进行了改进，超越了经典的扩散模型，对从样本到目标度量的距离实现了无维界。

We define diffusion-based generative models in infinite dimensions, and apply them to the generative modeling of functions. By first formulating such models in the infinite-dimensional limit and only then discretizing, we are able to obtain a sampling algorithm that has emph{dimension-free} bounds on the distance from the sample measure to the target measure. Furthermore, we propose a new way to perform conditional sampling in an infinite-dimensional space and show that our approach outperforms previously suggested procedures.

https://arxiv.org/abs/2302.10130

[CV] Invertible Neural Skinning

Y Kant, A Siarohin, R A Guler, M Chai, J Ren, S Tulyakov, I Gilitschenski
[University of Toronto & Snap Research]

可逆神经表层建模

要点:

现有的重定位方法表现力有限，需要昂贵的网格提取，而且通常不能在不同的姿态保留表面对应关系；
可逆神经表层建模(INS)解决了这些缺点，提出一种以姿势为条件的可逆网络(PIN)架构来学习以姿态为条件的变形，并将其与可微的线性混合表层(LBS)模块相结合；
INS 在着装人体的表现优于以前的方法，而在更简单和穿着最少衣服的人体上保持竞争力；
与之前的方法相比，在制作长姿态序列动画时，INS的速度提高了一个数量级。

一句话总结:
可逆神经表层建模(INS)是一个端到端的可微管道，在着装人体的表现优于最先进的表层建模技术，同时保留了表面对应关系，速度快一个数量级。

Building animatable and editable models of clothed humans from raw 3D scans and poses is a challenging problem. Existing reposing methods suffer from the limited expressiveness of Linear Blend Skinning (LBS), require costly mesh extraction to generate each new pose, and typically do not preserve surface correspondences across different poses. In this work, we introduce Invertible Neural Skinning (INS) to address these shortcomings. To maintain correspondences, we propose a Pose-conditioned Invertible Network (PIN) architecture, which extends the LBS process by learning additional pose-varying deformations. Next, we combine PIN with a differentiable LBS module to build an expressive and end-to-end Invertible Neural Skinning (INS) pipeline. We demonstrate the strong performance of our method by outperforming the state-of-the-art reposing techniques on clothed humans and preserving surface correspondences, while being an order of magnitude faster. We also perform an ablation study, which shows the usefulness of our pose-conditioning formulation, and our qualitative results display that INS can rectify artefacts introduced by LBS well. See our webpage for more details: this https URL

https://arxiv.org/abs/2302.09227

[LG] Provable Copyright Protection for Generative Models

N Vyas, S Kakade, B Barak
[Harvard School of Engineering and Applied Sciences,]

生成模型可证版权保护

要点:

提出 NAF 来量化生成模型中的版权侵权程度；
开发算法来输出对受保护内容的采样概率有很强约束的模型；
使用所开发算法时，输出质量降幅最小；
在证明版权侵权时将访问和相似性分开。

一句话总结:
提出 NAF 来量化生成模型复制受保护材料的程度，并为输出模型提供算法，对受保护内容的采样概率有很强的约束，而输出质量没有明显下降。

There is a growing concern that learned conditional generative models may output samples that are substantially similar to some copyrighted data C that was in their training set. We give a formal definition of near access-freeness (NAF) and prove bounds on the probability that a model satisfying this definition outputs a sample similar to C, even if C is included in its training set. Roughly speaking, a generative model p is k-NAF if for every potentially copyrighted data C, the output of p diverges by at most k-bits from the output of a model q that did not access C at all. We also give generative model learning algorithms, which efficiently modify the original generative model learning algorithm in a black box manner, that output generative models with strong bounds on the probability of sampling protected content. Furthermore, we provide promising experiments for both language (transformers) and image (diffusion) generative models, showing minimal degradation in output quality while ensuring strong protections against sampling protected content.

https://arxiv.org/abs/2302.10870

正文完

可以使用微信扫码关注公众号（ID：xzluomor）